1,133 research outputs found

    Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations

    Get PDF
    Consider the following heuristic for building a decision tree for a function f:{0,1}n{±1}f : \{0,1\}^n \to \{\pm 1\}. Place the most influential variable xix_i of ff at the root, and recurse on the subfunctions fxi=0f_{x_i=0} and fxi=1f_{x_i=1} on the left and right subtrees respectively; terminate once the tree is an ε\varepsilon-approximation of ff. We analyze the quality of this heuristic, obtaining near-matching upper and lower bounds: \circ Upper bound: For every ff with decision tree size ss and every ε(0,12)\varepsilon \in (0,\frac1{2}), this heuristic builds a decision tree of size at most sO(log(s/ε)log(1/ε))s^{O(\log(s/\varepsilon)\log(1/\varepsilon))}. \circ Lower bound: For every ε(0,12)\varepsilon \in (0,\frac1{2}) and s2O~(n)s \le 2^{\tilde{O}(\sqrt{n})}, there is an ff with decision tree size ss such that this heuristic builds a decision tree of size sΩ~(logs)s^{\tilde{\Omega}(\log s)}. We also obtain upper and lower bounds for monotone functions: sO(logs/ε)s^{O(\sqrt{\log s}/\varepsilon)} and sΩ~(logs4)s^{\tilde{\Omega}(\sqrt[4]{\log s } )} respectively. The lower bound disproves conjectures of Fiat and Pechyony (2004) and Lee (2009). Our upper bounds yield new algorithms for properly learning decision trees under the uniform distribution. We show that these algorithms---which are motivated by widely employed and empirically successful top-down decision tree learning heuristics such as ID3, C4.5, and CART---achieve provable guarantees that compare favorably with those of the current fastest algorithm (Ehrenfeucht and Haussler, 1989). Our lower bounds shed new light on the limitations of these heuristics. Finally, we revisit the classic work of Ehrenfeucht and Haussler. We extend it to give the first uniform-distribution proper learning algorithm that achieves polynomial sample and memory complexity, while matching its state-of-the-art quasipolynomial runtime

    Learning Stochastic Decision Trees

    Get PDF

    A Strong Composition Theorem for Junta Complexity and the Boosting of Property Testers

    Full text link
    We prove a strong composition theorem for junta complexity and show how such theorems can be used to generically boost the performance of property testers. The ε\varepsilon-approximate junta complexity of a function ff is the smallest integer rr such that ff is ε\varepsilon-close to a function that depends only on rr variables. A strong composition theorem states that if ff has large ε\varepsilon-approximate junta complexity, then gfg \circ f has even larger ε\varepsilon'-approximate junta complexity, even for εε\varepsilon' \gg \varepsilon. We develop a fairly complete understanding of this behavior, proving that the junta complexity of gfg \circ f is characterized by that of ff along with the multivariate noise sensitivity of gg. For the important case of symmetric functions gg, we relate their multivariate noise sensitivity to the simpler and well-studied case of univariate noise sensitivity. We then show how strong composition theorems yield boosting algorithms for property testers: with a strong composition theorem for any class of functions, a large-distance tester for that class is immediately upgraded into one for small distances. Combining our contributions yields a booster for junta testers, and with it new implications for junta testing. This is the first boosting-type result in property testing, and we hope that the connection to composition theorems adds compelling motivation to the study of both topics.Comment: 44 pages, 1 figure, FOCS 202

    Decision Tree Heuristics Can Fail, Even in the Smoothed Setting

    Get PDF
    Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996). Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets f that are k-juntas, they showed that these heuristics successfully learn f with depth-k decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-k decision trees. We provide a counterexample to this conjecture: we construct targets that are depth-k decision trees and show that even in the smoothed setting, these heuristics build trees of depth 2^{?(k)} before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to k-juntas, for which these heuristics build trees of depth 2^{?(k)} before achieving high accuracy

    A Query-Optimal Algorithm for Finding Counterfactuals

    Full text link
    We design an algorithm for finding counterfactuals with strong theoretical guarantees on its performance. For any monotone model f:Xd{0,1}f : X^d \to \{0,1\} and instance xx^\star, our algorithm makes S(f)O(Δf(x))logd {S(f)^{O(\Delta_f(x^\star))}\cdot \log d} queries to ff and returns {an {\sl optimal}} counterfactual for xx^\star: a nearest instance xx' to xx^\star for which f(x)f(x)f(x')\ne f(x^\star). Here S(f)S(f) is the sensitivity of ff, a discrete analogue of the Lipschitz constant, and Δf(x)\Delta_f(x^\star) is the distance from xx^\star to its nearest counterfactuals. The previous best known query complexity was dO(Δf(x))d^{\,O(\Delta_f(x^\star))}, achievable by brute-force local search. We further prove a lower bound of S(f)Ω(Δf(x))+Ω(logd)S(f)^{\Omega(\Delta_f(x^\star))} + \Omega(\log d) on the query complexity of any algorithm, thereby showing that the guarantees of our algorithm are essentially optimal.Comment: 22 pages, ICML 202
    corecore